# session info
# -- R version 3.6.3 (2020-02-29)
# libraries
library(dplyr) # Version ‘0.8.5’
library(ggplot2) # Version '3.3.0'
# library(stringr) # Version '1.4.0'
library(gam) # Version '1.16.1'
library(MASS) # Version '7.3-51.5'
library(DHARMa) # Version '0.3.1'
# DHARMa vignette: https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html
library(AER) # Version '1.2-9'
# library(statR) # Version ‘0.0.0.9000’
# Data set for 2019
miv_2019 <- readr::read_csv("./miv.csv")
# Data set for 2020
miv_2020 <- readr::read_csv("./Mobility_VerkehrsmessstellenKantonZH.csv")
## `summarise()` ungrouping output (override with `.groups` argument)
head(miv_train_day)
head(miv_test)
Skewness = -061 -> Negative- or left-skewed distribution Flow data is negatively skewed: We see a large share of relatively high values, but also abou t one third of time intervals with smaller numbers of vehicles per day. Surprisingly no zeros.
With count data that follows a Poisson distribution, a square root transformation is recommended. In this case, the square root transformation does not result in an improvement on teh distribution. For the model fitting, we will go with the original floa data.
The square root transformation does not distribute the flow values more evenly. The original data is more evenly distributed - outliers primarily in the lower end of the value range.
Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair. It can take values between -1 and +1. A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) suggests that much of variation of the response variable is unexplained by the predictor, in which case, we should probably look for a better predictor.
According to correlation, most variation of the flow_daysum variable is explained by the predictor weekyday_num. But we see that there is at least some sort of relationship with all explanatory variables.
As the correlation matrix suggests all explenatory variables do have an influence on the response variable, the linear model will include all variables.
linearMod <- lm(flow_daysum ~ month + weekday + holiday + schoolholiday, data=miv_train_day)
p Value: The model p-Value and almost all the p-Values of individual predictor variables (besides some weekdays) are smaller then 0.05. We can conclude that our model is indeed statistically significant. There exists a relationship between the independent variable in question and most of the dependent variables.
Adj. R-squared: 0.8872 -> ~88% of variation in the response variable has been explained by this model.
Accuracy: Besides looking at the adj. R-squared for efficiency of the model, it’s a better practice to look at the AIC and prediction accuracy on validation sample for an accuracy measure.
Goodness-of-fit (AIC): AIC(linearMod) => 3005.332. Rule of thumb: the lower the better.
##
## Call:
## lm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday,
## data = miv_train_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3709.7 -411.8 -4.6 510.4 2590.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14374.32 245.11 58.644 < 2e-16 ***
## month 561.09 42.21 13.293 < 2e-16 ***
## weekdayMonday -1342.91 265.20 -5.064 1.06e-06 ***
## weekdaySaturday -3873.14 262.59 -14.750 < 2e-16 ***
## weekdaySunday -5808.48 262.39 -22.136 < 2e-16 ***
## weekdayThursday -22.26 262.00 -0.085 0.932
## weekdayTuesday -247.26 262.02 -0.944 0.347
## weekdayWednesday 23.16 262.41 0.088 0.930
## holiday -5861.86 377.31 -15.536 < 2e-16 ***
## schoolholiday -1555.41 188.86 -8.236 4.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 944.6 on 171 degrees of freedom
## Multiple R-squared: 0.8928, Adjusted R-squared: 0.8872
## F-statistic: 158.3 on 9 and 171 DF, p-value: < 2.2e-16
poissonMod <- glm(flow_daysum ~ month + weekday + holiday + schoolholiday, family="poisson", data = miv_train_day)
##
## Call:
## glm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday,
## family = "poisson", data = miv_train_day)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -32.058 -3.600 0.305 4.201 25.765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.5652708 0.0021352 4479.704 <2e-16 ***
## month 0.0393792 0.0003743 105.214 <2e-16 ***
## weekdayMonday -0.0891981 0.0022912 -38.930 <2e-16 ***
## weekdaySaturday -0.2778933 0.0023688 -117.316 <2e-16 ***
## weekdaySunday -0.4484593 0.0024827 -180.637 <2e-16 ***
## weekdayThursday -0.0005565 0.0022066 -0.252 0.801
## weekdayTuesday -0.0187994 0.0022155 -8.486 <2e-16 ***
## weekdayWednesday -0.0015176 0.0022148 -0.685 0.493
## holiday -0.4839485 0.0040583 -119.250 <2e-16 ***
## schoolholiday -0.1200451 0.0017691 -67.858 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 106307 on 180 degrees of freedom
## Residual deviance: 12289 on 171 degrees of freedom
## AIC: 14368
##
## Number of Fisher Scoring iterations: 4
Checking for overdispersion: residual deviance is 12289 for 171 degrees of freedom.
The ratio of deviance to df should be around 1, here it’s 71.86, indicating overdispersion.
A common reason for overdispersion is a misspecified model, meaning the assumptions of the model are not met, hence we cannot trust its output (e.g. p-Values). When overdispersion is detected, one should therefore first search for problems in the model specification (e.g. by plotting residuals against predictors), and only if this doesn’t lead to success, overdispersion corrections such as changes in the distribution should be applied.
We see various signals of overdispersion:
QQ: s-shaped QQ plot, distribution and dispersion test are signficant
Residual vs. predicted: Quantile fits are spread out too far. We get more residuals around 0 and 1, which means that more residuals are in the tail of distribution than would be expected under the fitted model.
We need to adjust for overdispersion.
The quasi-families augment the normal families by adding a dispersion parameter.
##
## Call:
## glm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday,
## family = "quasipoisson", data = miv_train_day)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -32.058 -3.600 0.305 4.201 25.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5652708 0.0179871 531.784 < 2e-16 ***
## month 0.0393792 0.0031529 12.490 < 2e-16 ***
## weekdayMonday -0.0891981 0.0193010 -4.621 7.48e-06 ***
## weekdaySaturday -0.2778933 0.0199542 -13.927 < 2e-16 ***
## weekdaySunday -0.4484593 0.0209136 -21.443 < 2e-16 ***
## weekdayThursday -0.0005565 0.0185878 -0.030 0.976
## weekdayTuesday -0.0187994 0.0186629 -1.007 0.315
## weekdayWednesday -0.0015176 0.0186575 -0.081 0.935
## holiday -0.4839485 0.0341866 -14.156 < 2e-16 ***
## schoolholiday -0.1200451 0.0149024 -8.055 1.30e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 70.96227)
##
## Null deviance: 106307 on 180 degrees of freedom
## Residual deviance: 12289 on 171 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
## Start: AIC=14367.92
## flow_daysum ~ month + weekday + holiday + schoolholiday
##
## Call: glm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday,
## family = "poisson", data = miv_train_day)
##
## Coefficients:
## (Intercept) month weekdayMonday weekdaySaturday
## 9.5652708 0.0393792 -0.0891981 -0.2778933
## weekdaySunday weekdayThursday weekdayTuesday weekdayWednesday
## -0.4484593 -0.0005565 -0.0187994 -0.0015176
## holiday schoolholiday
## -0.4839485 -0.1200451
##
## Degrees of Freedom: 180 Total (i.e. Null); 171 Residual
## Null Deviance: 106300
## Residual Deviance: 12290 AIC: 14370
Negative binomial includes a dispersion parameter.
Already here we see that the ratio of deviance and df is near 1 (1.062164) and hence probably fine.
nbMod <- glm.nb(flow_daysum ~ month + weekday + holiday + schoolholiday, data = miv_train_day)
summary(nbMod)
##
## Call:
## glm.nb(formula = flow_daysum ~ month + weekday + holiday + schoolholiday,
## data = miv_train_day, init.theta = 178.1534899, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8771 -0.4259 0.0185 0.4836 3.0326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.5598179 0.0195613 488.712 < 2e-16 ***
## month 0.0409996 0.0033697 12.167 < 2e-16 ***
## weekdayMonday -0.0865983 0.0211618 -4.092 4.27e-05 ***
## weekdaySaturday -0.2777866 0.0209634 -13.251 < 2e-16 ***
## weekdaySunday -0.4505642 0.0209618 -21.495 < 2e-16 ***
## weekdayThursday 0.0001086 0.0208994 0.005 0.996
## weekdayTuesday -0.0215609 0.0209031 -1.031 0.302
## weekdayWednesday -0.0022003 0.0209340 -0.105 0.916
## holiday -0.4919153 0.0302063 -16.285 < 2e-16 ***
## schoolholiday -0.1187551 0.0150871 -7.871 3.51e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(178.1535) family taken to be 1)
##
## Null deviance: 1423.96 on 180 degrees of freedom
## Residual deviance: 181.63 on 171 degrees of freedom
## AIC: 3054.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 178.2
## Std. Err.: 19.0
##
## 2 x log-likelihood: -3032.052
Dispersion test is still significant -> still overdispersion.
##
## DHARMa nonparametric dispersion test via mean deviance residual fitted
## vs. simulated-refitted
##
## data: simulationNBMod
## dispersion = 0.97995, p-value = 0.016
## alternative hypothesis: two.sided
gamMod <- mgcv::gam(flow_daysum ~ month + weekday + holiday + schoolholiday, data = miv_train_day, method = "REML")
##
## Family: gaussian
## Link function: identity
##
## Formula:
## flow_daysum ~ month + weekday + holiday + schoolholiday
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14374.32 245.11 58.644 < 2e-16 ***
## month 561.09 42.21 13.293 < 2e-16 ***
## weekdayMonday -1342.91 265.20 -5.064 1.06e-06 ***
## weekdaySaturday -3873.14 262.59 -14.750 < 2e-16 ***
## weekdaySunday -5808.48 262.39 -22.136 < 2e-16 ***
## weekdayThursday -22.26 262.00 -0.085 0.932
## weekdayTuesday -247.26 262.02 -0.944 0.347
## weekdayWednesday 23.16 262.41 0.088 0.930
## holiday -5861.86 377.31 -15.536 < 2e-16 ***
## schoolholiday -1555.41 188.86 -8.236 4.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.887 Deviance explained = 89.3%
## -REML = 1431.2 Scale est. = 8.9221e+05 n = 181
##
## Method: REML Optimizer: outer newton
## full convergence after 3 iterations.
## Gradient range [-3.581009e-08,-3.581009e-08]
## (score 1431.182 & scale 892207.3).
## Hessian positive definite, eigenvalue range [85.5,85.5].
## Model rank = 10 / 10
gamMod_cs <- mgcv::gam(flow_daysum ~ s(month, bs = 'cc', k = 6) + weekday + holiday + schoolholiday, data = miv_train_day, method = "REML")
##
## Family: gaussian
## Link function: identity
##
## Formula:
## flow_daysum ~ s(month, bs = "cc", k = 6) + weekday + holiday +
## schoolholiday
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16444.07 265.05 62.042 < 2e-16 ***
## weekdayMonday -1379.52 372.46 -3.704 0.000287 ***
## weekdaySaturday -3811.27 368.74 -10.336 < 2e-16 ***
## weekdaySunday -5785.15 368.51 -15.699 < 2e-16 ***
## weekdayThursday -68.40 367.98 -0.186 0.852755
## weekdayTuesday -310.43 367.94 -0.844 0.400024
## weekdayWednesday -32.26 368.51 -0.088 0.930354
## holiday -5620.81 533.90 -10.528 < 2e-16 ***
## schoolholiday -2056.42 265.12 -7.757 7.62e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(month) 1.632 4 1.082 0.0566 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.778 Deviance explained = 78.9%
## -REML = 1496 Scale est. = 1.7593e+06 n = 181
Looking at the smooth term, we can see that the monthly smoother is picking up that monthly rise and fall of flow.
Looking at the relative magnitudes (i.e. monthly fluctuation vs. long-term trend), we can see how important it is to disintangle the components of the time series.
##
## Method: REML Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-7.065438e-06,-2.041466e-07]
## (score 1496.047 & scale 1759326).
## Hessian positive definite, eigenvalue range [0.4583535,86.00779].
## Model rank = 13 / 13
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(month) 4.00 1.63 0.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gamMod_3 <- mgcv::gam(flow_daysum ~ s(month, bs = 'cc', k = 6) + s(weekday_num, bs = 'cc', k = 7) + holiday + schoolholiday, data = miv_train_day)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## flow_daysum ~ s(month, bs = "cc", k = 6) + s(weekday_num, bs = "cc",
## k = 7) + holiday + schoolholiday
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14803.9 148.0 100.016 < 2e-16 ***
## holiday -4851.9 709.6 -6.838 1.33e-10 ***
## schoolholiday -2150.6 351.0 -6.126 5.91e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(month) 0.9206 4 0.352 0.21
## s(weekday_num) 4.5307 5 34.569 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.598 Deviance explained = 61.5%
## GCV = 3.3362e+06 Scale est. = 3.1804e+06 n = 181
The cyclical smoother for weekday is picking up the rise and fall of flow over the week.
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 9 iterations.
## The RMS GCV score gradient at convergence was 25.12427 .
## The Hessian was positive definite.
## Model rank = 12 / 12
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(month) 4.000 0.921 0.79 <2e-16 ***
## s(weekday_num) 5.000 4.531 0.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare the actual fit of all models:
miv_pred_0 <- data.frame(time = miv_train_day$date,
flow = miv_train_day$flow_daysum,
predicted_values_0 = predict(linearMod, newdata = miv_train_day))
miv_pred <- data.frame(time = miv_test$date,
flow = miv_test$flow_daysum,
predicted_values = predict(linearMod, newdata = miv_test, interval = "confidence"))
miv_pred_2 <- data.frame(time = miv_test$date,
flow = miv_test$flow_daysum,
predicted_values_2 = predict(gamMod, newdata = miv_test))
Changes:
improved_mod <- mgcv::gam(flow_daysum ~ time_id + weekday + holiday + schoolholiday, data = miv_train_day, family=poisson)
##
## Family: poisson
## Link function: log
##
## Formula:
## flow_daysum ~ time_id + weekday + holiday + schoolholiday
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.578e+00 2.006e-03 4775.341 <2e-16 ***
## time_id 1.372e-03 1.224e-05 112.122 <2e-16 ***
## weekdayMonday -8.887e-02 2.291e-03 -38.789 <2e-16 ***
## weekdaySaturday -2.777e-01 2.369e-03 -117.226 <2e-16 ***
## weekdaySunday -4.494e-01 2.483e-03 -181.020 <2e-16 ***
## weekdayThursday -1.844e-03 2.206e-03 -0.836 0.403
## weekdayTuesday -1.938e-02 2.215e-03 -8.747 <2e-16 ***
## weekdayWednesday -2.345e-03 2.215e-03 -1.059 0.290
## holiday -4.809e-01 4.054e-03 -118.609 <2e-16 ***
## schoolholiday -1.192e-01 1.765e-03 -67.548 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.901 Deviance explained = 89.8%
## UBRE = 58.734 Scale est. = 1 n = 181
##
## Method: UBRE Optimizer: outer newton
## Model required no smoothing parameter selectionModel rank = 10 / 10
miv_pred_999 <- data.frame(time = miv_test$time_id,
flow = miv_test$flow_daysum,
predicted_values = predict(improved_mod, newdata = miv_test))
head(miv_pred_999)